1 Overview

Multiple regression is one of the most popular methods used in statistics as well as in machine learning. We use linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we could to determine the form of the response as well as the function format for the factors. Then, when we have many possible features to be included in the working model it is inevitable that we need to choose a best possible model with a sensible criterion. Cp, BIC and regularizations such as LASSO are introduced. Be aware that if a model selection is done formally or informally, the inferences obtained with the final lm() fit may not be valid. Some adjustment will be needed. This last step is beyond the scope of this class. Check the current research line that Linda and collaborators are working on.

This homework consists of two parts: the first one is an exercise (you will feel it being a toy example after the covid case study) to get familiar with model selection skills such as, Cp and BIC. The main job is a rather involved case study about devastating covid19 pandemic. Please read through the case study first. This project is for sure a great one listed in your CV.

For covid case study, the major time and effort would be needed in EDA portion.

1.1 Objectives

  • Model building process

  • Methods

    • Model selection
      • All subsets
      • Forward/Backward
    • Regularization
      • LASSO (L1 penalty)
      • Ridge (L2 penalty)
      • Elastic net
  • Understand the criteria

    • Cp
    • Testing Errors
    • BIC
    • K fold Cross Validation
    • LASSO
  • Packages

    • lm(), Anova
    • regsubsets()
    • glmnet() & cv.glmnet()

2 Review materials

  • Study lecture: Model selection
  • Study lecture: Regularization
  • Study lecture: Multiple regression

Review the code and concepts covered during lectures: multiple regression, model selection and penalized regression through elastic net.

3 Homework 2, Case study 3: Auto data set

If you haven’t done this as part of the homework 2, please attach it here.

4 Case study 1: ISLR::Auto data

This will be the last part of the Auto data from ISLR. The original data contains 408 observations about cars. It has some similarity as the Cars data that we use in our lectures. To get the data, first install the package ISLR. The data set Auto should be loaded automatically. We use this case to go through methods learned so far.

Final modelling question: We want to explore the effects of each feature as best as possible.

  1. Preparing variables:
  • mpg summary
Var1 Freq
Min. 9.0
1st Qu. 17.0
Median 22.8
Mean 23.4
3rd Qu. 29.0
Max. 46.6
  • cylinders summary
Var1 Freq
Min. 3.00
1st Qu. 4.00
Median 4.00
Mean 5.47
3rd Qu. 8.00
Max. 8.00
  • displacement summary
Var1 Freq
Min. 68
1st Qu. 105
Median 151
Mean 194
3rd Qu. 276
Max. 455
  • horsepower summary
Var1 Freq
Min. 46.0
1st Qu. 75.0
Median 93.5
Mean 104.5
3rd Qu. 126.0
Max. 230.0
  • weight summary
Var1 Freq
Min. 1613
1st Qu. 2225
Median 2804
Mean 2978
3rd Qu. 3615
Max. 5140
  • acceleration summary
Var1 Freq
Min. 8.0
1st Qu. 13.8
Median 15.5
Mean 15.5
3rd Qu. 17.0
Max. 24.8
  • year summary

    total 13 years

    from 1970-1982

  • Origin of car

    American: 245

    European: 68

    Japanese: 79

  • Auto names

    unique auto names: 301

4.1 What effect does time have on MPG?

  1. Start with a simple regression of mpg vs. year and report R’s summary output. Is year a significant variable at the .05 level? State what effect year has on mpg, if any, according to this model.
Dependent variable:
mpg
year 1.230***
(1.060, 1.400)
Constant -70.000***
(-83.000, -57.000)
Observations 392
R2 0.337
Adjusted R2 0.335
Residual Std. Error 6.360 (df = 390)
F Statistic 198.000*** (df = 1; 390)
Note: p<0.1; p<0.05; p<0.01

Year is significant at the 0.01 level. Our model is saying that for every year that goes by, there is about a 1.230 increase in the mpg of a car.

  1. Add horsepower on top of the variable year to your linear model. Is year still a significant variable at the .05 level? Give a precise interpretation of the year’s effect found here. (Table 4)_
Dependent variable:
mpg
year 0.657***
(0.527, 0.787)
horsepower -0.132***
(-0.144, -0.119)
Constant -12.700**
(-23.200, -2.250)
Observations 392
R2 0.685
Adjusted R2 0.684
Residual Std. Error 4.390 (df = 389)
F Statistic 424.000*** (df = 2; 389)
Note: p<0.1; p<0.05; p<0.01

Year is significant at the 0.01 level. Our model is saying that for every year that passes by, there is about a .657 increase in the mpg of a car. This effect size decreases from the previous one since we added horsepower to the dataset. (Table 5)

  1. The two 95% CI’s for the coefficient of year differ among (i) and (ii). How would you explain the difference to a non-statistician?

The confidence intervals got a lot smaller going from (i) to (ii). Since we added more information to the model (horspower) this reduces some of the variability that we see when we examine year alone. This reduction in conifidence interval means that we are likely getting more precise.

  1. Create a model with interaction by fitting lm(mpg ~ year * horsepower). Is the interaction effect significant at .05 level? Explain the year effect (if any).
Dependent variable:
mpg
year 2.190***
(1.880, 2.510)
horsepower 1.050***
(0.820, 1.270)
year:horsepower -0.016***
(-0.019, -0.013)
Constant -127.000***
(-150.000, -103.000)
Observations 392
R2 0.752
Adjusted R2 0.750
Residual Std. Error 3.900 (df = 388)
F Statistic 393.000*** (df = 3; 388)
Note: p<0.1; p<0.05; p<0.01

All of the variables are significant at the 0.01 level. Year is an extremely significant variable. Our model is saying that for every year that passes by, there is about a 2.190 increase in the mpg of a car. This effect size increases dramatically from the previous models. (Table 6)

4.2 Categorical predictors

Remember that the same variable can play different roles! Take a quick look at the variable cylinders, and try to use this variable in the following analyses wisely. We all agree that a larger number of cylinders will lower mpg. However, we can interpret cylinders as either a continuous (numeric) variable or a categorical variable.

  1. Fit a model that treats cylinders as a continuous/numeric variable. Is cylinders significant at the 0.01 level? What effect does cylinders play in this model?
Dependent variable:
mpg
cylinders -3.560***
(-3.840, -3.270)
Constant 42.900***
(41.300, 44.600)
Observations 392
R2 0.605
Adjusted R2 0.604
Residual Std. Error 4.910 (df = 390)
F Statistic 597.000*** (df = 1; 390)
Note: p<0.1; p<0.05; p<0.01

Cylinders is significant at the 0.01 level. Our model is saying that for every 1 cylinder added, there is about a 3.560 increase in the mpg of a car. (Table 7)

  1. Fit a model that treats cylinders as a categorical/factor. Is cylinders significant at the .01 level? What is the effect of cylinders in this model? Describe the cylinders effect over mpg.
Dependent variable:
mpg
factor(cylinders)4 8.730***
(4.080, 13.400)
factor(cylinders)5 6.820*
(-0.217, 13.900)
factor(cylinders)6 -0.577
(-5.290, 4.140)
factor(cylinders)8 -5.590**
(-10.300, -0.894)
Constant 20.600***
(15.900, 25.200)
Observations 392
R2 0.641
Adjusted R2 0.638
Residual Std. Error 4.700 (df = 387)
F Statistic 173.000*** (df = 4; 387)
Note: p<0.1; p<0.05; p<0.01

Only 4 Cylinders is significant at the 0.01 level. Our model is saying that for every 1 cylinder added, there is about a 3.560 increase in the mpg of a car. (Table 7)

  1. What are the fundamental differences between treating cylinders as a continuous and categorical variable in your models?

From a practical sense it’s not feasible to consider cylinders as a continuous variable because it’ll out put results that don’t make sense. It will assume that the more cylinders you have, the lower your mpg will be. However, considering cylinders as a categorical variable allows you to see that having different numbers of cylinders is not a linear relationship.

  1. Can you test the null hypothesis: fit0: mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable at .01 level?

Yes you can using anova(H_0, H_1). There is strong evidence of rejecting the null hypothesis that fit0: mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable

## Analysis of Variance Table
## 
## Model 1: mpg ~ cylinders
## Model 2: mpg ~ factor(cylinders)
##   Res.Df  RSS Df Sum of Sq    F  Pr(>F)    
## 1    390 9416                              
## 2    387 8544  3       871 13.2 3.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. You may explore the possibility of variable transformations. We normally do not suggest to transform \(x\) for the purpose of interpretation. You may consider to transform \(y\) to either correct the violation of the linear model assumptions or if you feel a transformation of \(y\) makes more sense from some theory. In this case we suggest you to look into GPM=1/MPG. Compare residual plots of MPG or GPM as responses and see which one might yield a more satisfactory patterns.

In addition, can you provide some background knowledge to support the notion: it makes more sense to model GPM?

GPM is the number of gallons needed to move a vehicle 100 miles. The gpm helps determine the vehicle’s fuel economy when taking other considerations into account. Both MPG and GPM have a useful role at different points in owning a car. MPG is useful when you’re driving a car. GPM is useful when you’re purchasing a car – it better captures the fuel consumption, and fuel savings, when comparing a current car to a new car, or when comparing two new cars to each other ref.

  1. You may also explore by adding interactions and higher order terms. The model(s) should be as parsimonious (simple) as possible, unless the gain in accuracy is significant from your point of view.

Exploring the interaction between hosrsepower and year

From the boxplot, we can see that Asian cars often have significantly higher mpg than American/European cars

  1. Use Mallow’s \(C_p\) or BIC to select the model.

Dependent variable:
mpg
year 0.757***
(0.660, 0.854)
weight -0.007***
(-0.007, -0.006)
Constant -14.300***
(-22.200, -6.500)
Observations 392
R2 0.808
Adjusted R2 0.807
Residual Std. Error 3.430 (df = 389)
F Statistic 819.000*** (df = 2; 389)
Note: p<0.1; p<0.05; p<0.01
Dependent variable:
1/mpg
year -0.001***
(-0.001, -0.001)
horsepower 0.0001***
(0.00005, 0.0001)
weight 0.00001***
(0.00001, 0.00001)
Constant 0.100***
(0.086, 0.113)
Observations 392
R2 0.881
Adjusted R2 0.880
Residual Std. Error 0.006 (df = 388)
F Statistic 957.000*** (df = 3; 388)
Note: p<0.1; p<0.05; p<0.01

The plots for MPG aren’t as evenly distributed vertically, has outliers, and has more of a shape than when compared to GPM. In the mpg residual plot, one may worry the violation of linearity, as well as presence of heteroscedasticity. There is more room for improvement in the mpg model

  1. Describe the final model and its accuracy. Include diagnostic plots with particular focus on the model residuals.
  • Summarize the effects found. The mpg increases with changes in the following features:
  • year increases
  • weight decreases

The gpm increases with changes in the following features: + year decreases
+ weight increases + horsepower increases

  • Predict the mpg of a car that is: built in 1983, in the US, red, 180 inches long, 8 cylinders, 350 displacement, 260 as horsepower, and weighs 4,000 pounds. Give a 95% CI.

From these specifications we predict that the mpg will be 22 with a CI of [15.2, 28.8] using the mpg model

However since the GPM model was better, From these specifications we predict that the mpg will be 15.6 with a CI of [13.1, 19.2] using the GPM model

  • Any suggestions as to how to improve the quality of the study?

If we could clean the data to take into account manufacturer that would improve the study

5 Case study 2: COVID19

The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 755 million cases have been confirmed worldwide, with nearly 6.8 million associated deaths by Feb of 2023. Within the US alone, there have been over 1.1 million deaths and upwards of 102 million cases reported by Feb of 2023. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.

The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.

We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.

There are two main goals for this case study.

  1. We show the dynamic evolvement of COVID cases and COVID-related death at state level.
  2. We try to figure out what county-level demographic and policy interventions are associated with mortality rate in the US. We try to construct models to find possible factors related to county-level COVID-19 mortality rates.
  3. This is a rather complex project. With our team’s help we have made your job easier.
  4. Hide all unnecessary lengthy R-output. Keep your write up neat, readable.

Remark1: The data and the statistics reported here were collected before February of 2021.

Remark 2: A group of RAs spent tremendous amount of time working together to assemble the data. It requires data wrangling skills.

Remark 3: Please keep track with the most updated version of this write-up.

6 Data Summary

The data comes from several different sources:

  1. County-level infection and fatality data - This dataset gives daily cumulative numbers on infection and fatality for each county.
  2. County-level socioeconomic data - The following are the four relevant datasets from this site.
    1. Income - Poverty level and household income.
    2. Jobs - Employment type, rate, and change.
    3. People - Population size, density, education level, race, age, household size, and migration rates.
    4. County Classifications - Type of county (rural or urban on a rural-urban continuum scale).
  3. Intervention Policy Data - This dataset is a manually compiled list of the dates that interventions/lockdown policies were implemented and lifted at the county level.

7 EDA

In this case study, we use the following three nearly cleaned data:

  • covid_county.csv: County-level socialeconomic information that combines the above-mentioned 4 datasets: Income (Poverty level and household income), Jobs (Employment type, rate, and change), People (Population size, density, education level, race, age, household size, and migration rates), County Classifications
  • covid_rates.csv: Daily cumulative numbers on infection and fatality for each county
  • covid_intervention.csv: County-level lockdown intervention.

Among all data, the unique identifier of county is FIPS.

The cleaning procedure is attached in Appendix 2: Data cleaning You may go through it if you are interested or would like to make any changes.

It may need more data wrangling.

First read in the data.

# county-level socialeconomic information
county_data <- fread("data/covid_county.csv") 
# county-level COVID case and death
covid_rate <- fread("data/covid_rates.csv")
# county-level lockdown dates 
# covid_intervention <- fread("data/covid_intervention.csv")

7.1 Understand the data

The detailed description of variables is in Appendix 1: Data description. Please get familiar with the variables. Summarize the two data briefly.

Race Distribution across US counties

Age Distribution across US counties

Education Distribution across US counties

Employment Distribution across US counties

## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Income Distribution across US counties

Unemployment Distribution across US counties

COVID-19 Total Cases and Deaths by state and county

7.2 COVID case trend

It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.

  1. Plot new COVID cases in NY, WA and FL by state and by day. Any irregular pattern? What is the biggest problem of using single day data?

  1. Create weekly new cases per 100k weekly_case_per100k. Plot the spaghetti plots of weekly_case_per100k by state. Use TotalPopEst2019 as population.
## `summarise()` has grouped output by 'State'. You can override using the
## `.groups` argument.

  1. Summarize the COVID case trend among states based on the plot in ii). What could be the possible reasons to explain the variabilities?

lockdown/when the first cases were reported, vaccine access, and governmental attitudes towards COVID protocol

  1. (Optional) Use covid_intervention to see whether the effectiveness of lockdown in flattening the curve.

7.3 COVID death trend

  1. For each month in 2020, plot the monthly deaths per 100k heatmap by state on US map. Use the same color range across months. (Hints: Set limits argument in scale_fill_gradient() or use facet_wrap(); use lubridate::month() and lubridate::year() to extract month and year from date; use tidyr::complete(state, month, fill = list(new_case_per100k = NA)) to complete the missing months with no cases.)
## `summarise()` has grouped output by 'State'. You can override using the
## `.groups` argument.

  1. (Optional) Use plotly to animate the monthly maps in i). Does it reveal any systematic way to capture the dynamic changes among states? (Hints: Follow Appendix 3: Plotly heatmap:: in Module 6 regularization lecture to plot the heatmap using plotly. Use frame argument in add_trace() for animation. plotly only recognizes abbreviation of state names. Use unique(us_map(regions = "states") %>% select(abbr, full)) to get the abbreviation and merge with the data to get state abbreviation.)

8 COVID factor

We now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.

  1. Create the response variable total_death_per100k as the total of number of COVID deaths per 100k by Feb 1, 2021. We suggest to take log transformation as log_total_death_per100k = log(total_death_per100k + 1). Merge total_death_per100k to county_data for the following analysis.

  2. Select possible variables in county_data as covariates. We provide county_data_sub, a subset variables from county_data, for you to get started. Please add any potential variables as you wish.

    1. Report missing values in your final subset of variables.

    2. In the following anaylsis, you may ignore the missing values.

county_data_sub <- county_data %>%
  select(County, State, FIPS, Deep_Pov_All, PovertyAllAgesPct, PerCapitaInc, UnempRate2019, PctEmpFIRE, PctEmpConstruction, PctEmpTrans, PctEmpMining, PctEmpTrade, PctEmpInformation, PctEmpAgriculture, PctEmpManufacturing, PctEmpServices, PopDensity2010, OwnHomePct, Age65AndOlderPct2010, TotalPop25Plus, Under18Pct2010, Ed2HSDiplomaOnlyPct, Ed3SomeCollegePct, Ed4AssocDegreePct, Ed5CollegePlusPct, ForeignBornPct, Net_International_Migration_Rate_2010_2019, NetMigrationRate1019, NaturalChangeRate1019, TotalPopEst2019, WhiteNonHispanicPct2010, NativeAmericanNonHispanicPct2010, BlackNonHispanicPct2010, AsianNonHispanicPct2010, HispanicPct2010, Type_2015_Update, RuralUrbanContinuumCode2013, UrbanInfluenceCode2013, Perpov_1980_0711, HiCreativeClass2000, HiAmenity, Retirement_Destination_2015_Update)
  1. Use LASSO to choose a parsimonious model with all available sensible county-level information. Force in State in the process. Why we need to force in State? You may use lambda.1se to choose a smaller model.

  2. Use Cp or BIC to fine tune the LASSO model from iii). Again force in State in the process. (You could do backward elimination to avoid using Cp or BIC)

  3. If necessary, reduce the model from iv) to a final model with all variables being significant at 0.05 level. Are the linear model assumptions all reasonably met?

  4. It has been shown that COVID affects elderly the most. It is also claimed that the COVID death rate among African Americans and Latinos is higher. Does your analysis support these arguments?

  5. Based on your final model, summarize your findings. In particular, summarize the state effect controlling for others. Provide intervention recommendations to policy makers to reduce COVID death rate.

  6. What else can we do to improve our model? What other important information we may have missed?

  7. (Optional) Would your findings be very different if you had refined the data in some way or imputed the missing values in part ii). Check PCA lecture, section 10 for imputations via softImpute.

9 Executive summary

Please summarize this project as follows (no more than one page):

  • Goal of the study

  • Data

    • Source and a brief description of the data
    • How do you assemble them together (mostly done by our team but you may present them as if you have done so)
  • Analyses

  • Methods used

  • Findings

  • Limitations

10 Appendix 1: Data description

A detailed summary of the variables in each data set follows:

10.1 Infection and fatality data

  • date: Date
  • county: County name
  • state: State name
  • fips: County code that uniquely identifies a county
  • cases: Number of cumulative COVID-19 infections
  • deaths: Number of cumulative COVID-19 deaths

10.2 Socioeconomic demographics

Income: Poverty level and household income

  • PovertyUnder18Pct: Poverty rate for children age 0-17, 2018
  • Deep_Pov_All: Deep poverty, 2014-18
  • Deep_Pov_Children: Deep poverty for children, 2014-18
  • PovertyAllAgesPct: Poverty rate, 2018
  • MedHHInc: Median household income, 2018 (In 2018 dollars)
  • PerCapitaInc: Per capita income in the past 12 months (In 2018 inflation adjusted dollars), 2014-18
  • PovertyAllAgesNum: Number of people of all ages in poverty, 2018
  • PovertyUnder18Num: Number of people age 0-17 in poverty, 2018

Jobs: Employment type, rate, and change

  • UnempRate2007-2019: Unemployment rate, 2007-2019

  • NumEmployed2007-2019: Employed, 2007-2019

  • NumUnemployed2007-2019: Unemployed, 2007-2019

  • PctEmpChange1019: Percent employment change, 2010-19

  • PctEmpChange1819: Percent employment change, 2018-19

  • PctEmpChange0719: Percent employment change, 2007-19

  • PctEmpChange0710: Percent employment change, 2007-10

  • NumCivEmployed: Civilian employed population 16 years and over, 2014-18

  • NumCivLaborforce2007-2019: Civilian labor force, 2007-2019

  • PctEmpFIRE: Percent of the civilian labor force 16 and over employed in finance and insurance, and real estate and rental and leasing, 2014-18

  • PctEmpConstruction: Percent of the civilian labor force 16 and over employed in construction, 2014-18

  • PctEmpTrans: Percent of the civilian labor force 16 and over employed in transportation, warehousing and utilities, 2014-18

  • PctEmpMining: Percent of the civilian labor force 16 and over employed in mining, quarrying, oil and gas extraction, 2014-18

  • PctEmpTrade: Percent of the civilian labor force 16 and over employed in wholesale and retail trade, 2014-18

  • PctEmpInformation: Percent of the civilian labor force 16 and over employed in information services, 2014-18

  • PctEmpAgriculture: Percent of the civilian labor force 16 and over employed in agriculture, forestry, fishing, and hunting, 2014-18

  • PctEmpManufacturing: Percent of the civilian labor force 16 and over employed in manufacturing, 2014-18

  • PctEmpServices: Percent of the civilian labor force 16 and over employed in services, 2014-18

  • PctEmpGovt: Percent of the civilian labor force 16 and over employed in public administration, 2014-18

People: Population size, density, education level, race, age, household size, and migration rates

  • PopDensity2010: Population density, 2010

  • LandAreaSQMiles2010: Land area in square miles, 2010

  • TotalHH: Total number of households, 2014-18

  • TotalOccHU: Total number of occupied housing units, 2014-18

  • AvgHHSize: Average household size, 2014-18

  • OwnHomeNum: Number of owner occupied housing units, 2014-18

  • OwnHomePct: Percent of owner occupied housing units, 2014-18

  • NonEnglishHHPct: Percent of non-English speaking households of total households, 2014-18

  • HH65PlusAlonePct: Percent of persons 65 or older living alone, 2014-18

  • FemaleHHPct: Percent of female headed family households of total households, 2014-18

  • FemaleHHNum: Number of female headed family households, 2014-18

  • NonEnglishHHNum: Number of non-English speaking households, 2014-18

  • HH65PlusAloneNum: Number of persons 65 years or older living alone, 2014-18

  • Age65AndOlderPct2010: Percent of population 65 or older, 2010

  • Age65AndOlderNum2010: Population 65 years or older, 2010

  • TotalPop25Plus: Total population 25 and older, 2014-18 - 5-year average

  • Under18Pct2010: Percent of population under age 18, 2010

  • Under18Num2010: Population under age 18, 2010

  • Ed1LessThanHSPct: Percent of persons with no high school diploma or GED, adults 25 and over, 2014-18

  • Ed2HSDiplomaOnlyPct: Percent of persons with a high school diploma or GED only, adults 25 and over, 2014-18

  • Ed3SomeCollegePct: Percent of persons with some college experience, adults 25 and over, 2014-18

  • Ed4AssocDegreePct: Percent of persons with an associate’s degree, adults 25 and over, 2014-18

  • Ed5CollegePlusPct: Percent of persons with a 4-year college degree or more, adults 25 and over, 2014-18

  • Ed1LessThanHSNum: No high school, adults 25 and over, 2014-18

  • Ed2HSDiplomaOnlyNum: High school only, adults 25 and over, 2014-18

  • Ed3SomeCollegeNum: Some college experience, adults 25 and over, 2014-18

  • Ed4AssocDegreeNum: Number of persons with an associate’s degree, adults 25 and over, 2014-18

  • Ed5CollegePlusNum: College degree 4-years or more, adults 25 and over, 2014-18

  • ForeignBornPct: Percent of total population foreign born, 2014-18

  • ForeignBornEuropePct: Percent of persons born in Europe, 2014-18

  • ForeignBornMexPct: Percent of persons born in Mexico, 2014-18

  • ForeignBornCentralSouthAmPct: Percent of persons born in Central or South America, 2014-18

  • ForeignBornAsiaPct: Percent of persons born in Asia, 2014-18

  • ForeignBornCaribPct: Percent of persons born in the Caribbean, 2014-18

  • ForeignBornAfricaPct: Percent of persons born in Africa, 2014-18

  • ForeignBornNum: Number of people foreign born, 2014-18

  • ForeignBornCentralSouthAmNum: Number of persons born in Central or South America, 2014-18

  • ForeignBornEuropeNum: Number of persons born in Europe, 2014-18

  • ForeignBornMexNum: Number of persons born in Mexico, 2014-18

  • ForeignBornAfricaNum: Number of persons born in Africa, 2014-18

  • ForeignBornAsiaNum: Number of persons born in Asia, 2014-18

  • ForeignBornCaribNum: Number of persons born in the Caribbean, 2014-18

  • Net_International_Migration_Rate_2010_2019: Net international migration rate, 2010-19

  • Net_International_Migration_2010_2019: Net international migration, 2010-19

  • Net_International_Migration_2000_2010: Net international migration, 2000-10

  • Immigration_Rate_2000_2010: Net international migration rate, 2000-10

  • NetMigrationRate0010: Net migration rate, 2000-10

  • NetMigrationRate1019: Net migration rate, 2010-19

  • NetMigrationNum0010: Net migration, 2000-10

  • NetMigration1019: Net Migration, 2010-19

  • NaturalChangeRate1019: Natural population change rate, 2010-19

  • NaturalChangeRate0010: Natural population change rate, 2000-10

  • NaturalChangeNum0010: Natural change, 2000-10

  • NaturalChange1019: Natural population change, 2010-19

  • TotalPop2010: Population size 4/1/2010 Census

  • TotalPopEst2010: Population size 7/1/2010

  • TotalPopEst2011: Population size 7/1/2011

  • TotalPopEst2012: Population size 7/1/2012

  • TotalPopEst2013: Population size 7/1/2013

  • TotalPopEst2014: Population size 7/1/2014

  • TotalPopEst2015: Population size 7/1/2015

  • TotalPopEst2016: Population size 7/1/2016

  • TotalPopEst2017: Population size 7/1/2017

  • TotalPopEst2018: Population size 7/1/2018

  • TotalPopEst2019: Population size 7/1/2019

  • TotalPopACS: Total population, 2014-18 - 5-year average

  • TotalPopEstBase2010: County Population estimate base 4/1/2010

  • NonHispanicAsianPopChangeRate0010: Population change rate Non-Hispanic Asian, 2000-10

  • PopChangeRate1819: Population change rate, 2018-19

  • PopChangeRate1019: Population change rate, 2010-19

  • PopChangeRate0010: Population change rate, 2000-10

  • NonHispanicNativeAmericanPopChangeRate0010: Population change rate Non-Hispanic Native American, 2000-10

  • HispanicPopChangeRate0010: Population change rate Hispanic, 2000-10

  • MultipleRacePopChangeRate0010: Population change rate multiple race, 2000-10

  • NonHispanicWhitePopChangeRate0010: Population change rate Non-Hispanic White, 2000-10

  • NonHispanicBlackPopChangeRate0010: Population change rate Non-Hispanic African American, 2000-10

  • MultipleRacePct2010: Percent multiple race, 2010

  • WhiteNonHispanicPct2010: Percent Non-Hispanic White, 2010

  • NativeAmericanNonHispanicPct2010: Percent Non-Hispanic Native American, 2010

  • BlackNonHispanicPct2010: Percent Non-Hispanic African American, 2010

  • AsianNonHispanicPct2010: Percent Non-Hispanic Asian, 2010

  • HispanicPct2010: Percent Hispanic, 2010

  • MultipleRaceNum2010: Population size multiple race, 2010

  • WhiteNonHispanicNum2010: Population size Non-Hispanic White, 2010

  • BlackNonHispanicNum2010: Population size Non-Hispanic African American, 2010

  • NativeAmericanNonHispanicNum2010: Population size Non-Hispanic Native American, 2010

  • AsianNonHispanicNum2010: Population size Non-Hispanic Asian, 2010

  • HispanicNum2010: Population size Hispanic, 2010

##County classifications

Type of county (rural or urban on a rural-urban continuum scale)

  • Type_2015_Recreation_NO: Recreation counties, 2015 edition

  • Type_2015_Farming_NO: Farming-dependent counties, 2015 edition

  • Type_2015_Mining_NO: Mining-dependent counties, 2015 edition

  • Type_2015_Government_NO: Federal/State government-dependent counties, 2015 edition

  • Type_2015_Update: County typology economic types, 2015 edition

  • Type_2015_Manufacturing_NO: Manufacturing-dependent counties, 2015 edition

  • Type_2015_Nonspecialized_NO: Nonspecialized counties, 2015 edition

  • RecreationDependent2000: Nonmetro recreation-dependent, 1997-00

  • ManufacturingDependent2000: Manufacturing-dependent, 1998-00

  • FarmDependent2003: Farm-dependent, 1998-00

  • EconomicDependence2000: Economic dependence, 1998-00

  • RuralUrbanContinuumCode2003: Rural-urban continuum code, 2003

  • UrbanInfluenceCode2003: Urban influence code, 2003

  • RuralUrbanContinuumCode2013: Rural-urban continuum code, 2013

  • UrbanInfluenceCode2013: Urban influence code, 2013

  • Noncore2013: Nonmetro noncore, outside Micropolitan and Metropolitan, 2013

  • Micropolitan2013: Micropolitan, 2013

  • Nonmetro2013: Nonmetro, 2013

  • Metro2013: Metro, 2013

  • Metro_Adjacent2013: Nonmetro, adjacent to metro area, 2013

  • Noncore2003: Nonmetro noncore, outside Micropolitan and Metropolitan, 2003

  • Micropolitan2003: Micropolitan, 2003

  • Metro2003: Metro, 2003

  • Nonmetro2003: Nonmetro, 2003

  • NonmetroNotAdj2003: Nonmetro, nonadjacent to metro area, 2003

  • NonmetroAdj2003: Nonmetro, adjacent to metro area, 2003

  • Oil_Gas_Change: Change in the value of onshore oil and natural gas production, 2000-11

  • Gas_Change: Change in the value of onshore natural gas production, 2000-11

  • Oil_Change: Change in the value of onshore oil production, 2000-11

  • Hipov: High poverty counties, 2014-18

  • Perpov_1980_0711: Persistent poverty counties, 2015 edition

  • PersistentChildPoverty_1980_2011: Persistent child poverty counties, 2015 edition

  • PersistentChildPoverty2004: Persistent child poverty counties, 2004

  • PersistentPoverty2000: Persistent poverty counties, 2004

  • Low_Education_2015_update: Low education counties, 2015 edition

  • LowEducation2000: Low education, 2000

  • HiCreativeClass2000: Creative class, 2000

  • HiAmenity: High natural amenities

  • RetirementDestination2000: Retirement destination, 1990-00

  • Low_Employment_2015_update: Low employment counties, 2015 edition

  • Population_loss_2015_update: Population loss counties, 2015 edition

  • Retirement_Destination_2015_Update: Retirement destination counties, 2015 edition

11 Appendix 2: Data cleaning

The raw data sets are dirty and need transforming before we can do our EDA. It takes time and efforts to clean and merge different data sources so we provide the final output of the cleaned and merged data. The cleaning procedure is as follows. Please read through to understand what is in the cleaned data. We set eval = data_cleaned in the following cleaning chunks so that these cleaning chunks will only run if any of data/covid_county.csv, data/covid_rates.csv or data/covid_intervention.csv does not exist.

# Indicator to check whether the data files exist
data_cleaned <- !(file.exists("data/covid_county.csv") & 
                    file.exists("data/covid_rates.csv") & 
                    file.exists("data/covid_intervention.csv"))

We first read in the table using data.table::fread(), as we did last time.

# COVID case/mortality rate data
covid_rates <- fread("data/us_counties.csv", na.strings = c("NA", "", "."))
nyc <- fread("data/nycdata.csv", na.strings = c("NA", "", "."))

# Socioeconomic data
income <- fread("data/income.csv", na.strings = c("NA", "", "."))
jobs <- fread("data/jobs.csv", na.strings = c("NA", "", "."))
people <- fread("data/people.csv", na.strings = c("NA", "", "."))
county_class <- fread("data/county_classifications.csv", na.strings = c("NA", "", "."))

# Internvention policy data
int_dates <- fread("data/intervention_dates.csv", na.strings = c("NA", "", "."))

11.1 Clean NYC data

The original NYC data contains more information than we need. We extract only the number of cases and deaths and format it the same as the covid_rates data.

# NYC county fips matching table 
nyc_fips <- data.table(FIPS = c('36005', '36047', '36061', '36081', '36085'),
                       County = c("BX", "BK", "MN", "QN", "SI"))

# nyc case
nyc_case <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
                   BX = BX_CASE_COUNT, 
                   BK = BK_CASE_COUNT, 
                   MN = MN_CASE_COUNT, 
                   QN = QN_CASE_COUNT, 
                   SI = SI_CASE_COUNT)]

nyc_case %<>% 
  pivot_longer(cols = BX:SI, 
               names_to = "County",
               values_to = "cases") %>%
  arrange(date) %>%
  group_by(County) %>%
  mutate(cum_cases = cumsum(cases))

# nyc death
nyc_death <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
                   BX = BX_DEATH_COUNT, 
                   BK = BK_DEATH_COUNT, 
                   MN = MN_DEATH_COUNT, 
                   QN = QN_DEATH_COUNT, 
                   SI = SI_DEATH_COUNT)]

nyc_death %<>% 
  pivot_longer(cols = BX:SI, 
               names_to = "County",
               values_to = "deaths") %>%
  arrange(date) %>%
  group_by(County) %>%
  mutate(cum_deaths = cumsum(deaths))

nyc_rates <- merge(nyc_case, 
                   nyc_death,
                   by = c("date", "County"),
                   all.x= T)

nyc_rates <- merge(nyc_rates,
                   nyc_fips,
                   by = "County")

nyc_rates$State <- "New York"
nyc_rates %<>% 
  select(date, FIPS, County, State, cum_cases, cum_deaths) %>%
  arrange(FIPS, date)

11.2 Continental US cases

We only consider cases and death in continental US. Alaska, Hawaii, and Puerto Rico have 02, 15, and 72 as their respective first 2 digits of their FIPS. We use the %/% operator for integer division to get the first 2 digits of FIPS. We also remove Virgin Islands and Northern Mariana Islands. All data of counties in NYC are aggregated as County == "New York City" in covid_rates with no FIPS, so we combine the NYC data into covid_rate.

covid_rates <- covid_rates %>% 
  arrange(fips, date) %>%
  filter(!(fips %/% 1000 %in% c(2, 15, 72))) %>%
  filter(county != "New York City") %>%
  filter(!(state %in% c("Virgin Islands", "Northern Mariana Islands"))) %>%
  rename(FIPS = "fips",
         County = "county",
         State = "state",
         cum_cases = "cases", 
         cum_deaths = "deaths")


covid_rates$date <- as.Date(covid_rates$date)

covid_rates <- rbind(covid_rates, 
                     nyc_rates)

11.3 COVID date to week

We set the week of Jan 21, 2020 (the first case of COVID case in US) as the first week (2020-01-19 to 2020-01-25).

covid_rates[, week := (interval("2020-01-19", date) %/% weeks(1)) + 1]

11.4 COVID infection/mortality rates

Merge the TotalPopEst2019 variable from the demographic data with covid_rates by FIPS.

covid_rates <- merge(covid_rates[!is.na(FIPS)], 
                     people[,.(FIPS = as.character(FIPS),
                                   TotalPopEst2019)],
                     by = "FIPS",  
                     all.x = TRUE)

11.5 NA in COVID data

NA values in the covid_rates data set correspond to a county not having confirmed cases/deaths. We replace the NA values in these columns with zeros. FIPS for Kansas city, Missouri, Rhode Island and some others are missing. We drop them for the moment and output the data up to week 57 as covid_rates.csv.

covid_rates$cum_cases[is.na(covid_rates$cum_cases)] <- 0
covid_rates$cum_deaths[is.na(covid_rates$cum_deaths)] <- 0
fwrite(covid_rates %>% 
         filter(week < 58) %>% 
         arrange(FIPS, date), 
       "data/covid_rates.csv")

11.6 Formatting date in int_dates

We convert the columns representing dates in int_dates to R Date types using as.Date(). We will need to specify that the origin parameter is "0001-01-01". We output the data as covid_intervention.csv.

int_dates <- int_dates[-1,]
date_cols <- names(int_dates)[-(1:3)]
int_dates[, (date_cols) := lapply(.SD, as.Date, origin = "0001-01-01"), 
          .SDcols = date_cols]

fwrite(int_dates, "data/covid_intervention.csv")

11.7 Merge demographic data

Merge the demographic data sets by FIPS and output as covid_county.csv.

countydata <-
  merge(x = income,
        y = merge(
          x = people,
          y = jobs,
          by = c("FIPS", "State", "County")),
        by = c("FIPS", "State", "County"),
        all = TRUE)


countydata <- 
  merge(
    x = countydata,
    y = county_class %>% rename(FIPS = FIPStxt), 
    by = c("FIPS", "State", "County"),
    all = TRUE
  )

# Check dimensions
# They are now 3279 x 208
dim(countydata)
fwrite(countydata, "data/covid_county.csv")